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ABSTRACT 



An automated means for defining a filter used to extract web 
content for a web page is disclosed wherein the extracted 
content is used in a recast web page. The recast web page 
may be produced by a hosting site, or may be part of an effort 
to revise a web site at a web ggntent-provider. First, a set of 
pages, possibly a single page,^is retrieved from a content 
provider web server. Next, the web page is parsed to identify 
a set of selectable content elements. Next, a representation 
of the original web page is presented in a user interface, 
wherein the selectable content elements are demarcated. The 
user will selectja ome^o^th eAcleme ntSyfor^inclusioniintthe 
filter 'iiuBugnl^^^^ ^piterface y wher e^>Mhe .. tool will 
indicate jhc^scjcgedcqntent elements^forfunclusion in^the 
filter. The tool construc^^fi^lter so that when the filter is 
used, the selected content elements are extracted from a 
retrieved web page from the content provider web server and 
reused in the recast web page. As part of the process of 
identifying the selectable content elements, a set of varied 
headers can be used to retrieve multiple versions of the same 
web page. In this way, the multiple versions of the web page 
are compared to identify static and dynamic content ele- 
ments and marked as static or dynamic. 

27 Claims, 14 Drawing Sheets 
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FILTER DEFINITION FOR DISTRIBUTION under the standard arrangement of exporting the content in 

MECHANISM FOR FILTERING, raw format to the licensing host and that host posting the 

FORMATTING AND REUSE OF WEB BASED articles on their own site without the publisher's advertise- 

CONTENT ments. 

5 Further, even if a web site operator could find a content 
provider willing to share their content at economically 

BACKGROUND OF THE INVENTION favorable terms, other problems exist. A single content 

Hie present invention relates generally to the data pro- provider may not be likely to provide the complete gamut of 

cessing systems. More particularly, it relates to managing articles which the hosting web site would like to serve to its 

and formatting electroaically-published material distributed web clients. It would be preferable that the hosting site be 

over a computer network. ^^^^ ^® content from a variety of potential content 

Hie World Wide Web is the Internet's multimedia infor- P^^^^^g ^^"^ Again, the likelihood of finding many 

mation retrieval system. In the Web environment, cHent J^^^/y content providers is even lower. Yet 

machines effect transactions to Web servers using the Hyper- , , f ^f/f ^ accomphshed, as each site hi^ its own 

text Transfer Protocol (HTTP), which is a known application !*^°^.^^ f^'^^^^ ^'^.^1:''^'' k ^ JT^I 

protocol providing users access to files (e.g., text, graphics, '\ «"g"^^y ^PP^^^^ °^ ^^f"^ f ^^'^ web sites, the hosting 

images,sound,video,etc.)usingastandardpagedescription "^.^^f P^f^^ a disjointed hodgepodge coUecUon of 

language known as Hypertext Markup Language (HTTVIL). ^natenal. It is hardly the professional image that the hosting 

HTML provides basic document formatting and allows the ^^^^^'^ '^^^^^^ P^J^^^* 

developer to specify "links" to other servers and files. In the ^ 1^ is unlikely that a web content provider who is essen- 

Intemel paradigm, a network path to a server is identified by ^ially sharing his content for free will be wiUmg to install 

a so-called Uniform Resource Locator (URL) having a special software or speciaUy format his information for the 

special syntax for defining a network connection. Use of an Costing site. If the matenal comes in raw format, consider- 

HTML-compatible browser (e.g., Netscape Navigator or ^"e manpower must thus be devoted to making borrowed 

Microsoft Internet Explorer) at a cHent machine involves ^ material on the hosting site look as though it was specifically 

specification of a link via the URL. In response, the client ^^e^^^ f^r the site. Tliis effort is naturally compounded 

makes a request to the server (sometimes referred to as a w^^^^ material comes from a range of web content provid- 

"Web site'*) identified in the link and, in return, receives in I^rther, there is likely to be some lag between the time 

return a document or other object formatted according to ^ that the web content is avaflable on the content provider's 

HTML. web page and its appearance on the hosting site. This dilutes 

. ' V , ,1 1 1 u the desired appearance of the hosting site having the latest 

Among the many chaEenges in runmng a successful web . t t t * 1 

site is the constant creation and updating the web pages and grea es ma en . 

other files, i.e. web content, to keep the site fresh and new In reality, the hosUng site is unhkely to find many partners 

and attractive to web users. Web sites which do not update 3, without sotne con^^Lncmg demonstration that its reuse of the 

their content on a regular basis tend to lose their favor. matenal will somehow benefit the onginal content provider 

Eventually, fewer "hits" are logged on the web site's pages ^^V' ^^^s endanger his revenue stream, 

as fewer users view the information or advertisements which present mvention solves this unportant problem, 

the web site is publishing. As web based advertising fees are SUMMARY OF THE INVENTION 

typically based on the number of hits a page or site receives, • u- * r*u • * j, *i- j 

.ff j*^ ..1 J- *i J J 1 re i XL It is an obiect 01 the mvention to reduce the expense and 

this reduction will directly and adversely affect the revenues cc ^ c -j- ^ ^* l *x 

r ^ \ ^ ^ c ^v. L effort of providing content m a new hosting web site or to 

of the web site. Of course, the constant update of the web j . *l * T £ u * * -j u v 

, ^ , • * • *t. 1-4 f *i. update the content of an web content provider web site, 

content, while necessary to maintain the populanty of the . . , . ^ , . . ^ . «. 

c-;*^ ,v i« ♦^^^ «f t,-™^ It IS another object of the invention to reduce the effort 

site, is very expensive m terms 01 manpower and time. ^, ^ . 

^ . , . r ..1 , needed to develop a filter for extraclmg desired content 

Furthermore, much of the mrormation on a particular web 45 „i„„„„,„ „,^i ^„ 

. J J ' . J . • J- -1 LI elements rrom a set or web pages, 

site is redundant when compared to mformation available on , . 1... • ^. . . 

other similar sites. Some of this duplicate infomiation rep- is another object of the mvention to reuse content from 

resents diffei^nces in opinion and is no doubt the sign of a ^ ^^^j^ °f content providers some of which may 

tolerant and free society However, much of the info^ation ^.^^^^^ ^^^^^^^ ^^T"^*" T c 

is simply a duphcation of the same news on each web site. 50 }^ ^ T^^' "^'J?^ mvention to adapt content from 

From the perspective of the web site content provider, it other web sites to the appearance of the hostmg web site so 

would be efficient if some of the information found on odicr ^ P^^^^^y ^^"^ sites appears native 

sites could be reused or "hosted" on his site. Thus, additional ^ ^lo^trng web site. 

manpower for writing and entering articles on the web 1^ is another object of the invention to automatically 

server can be reduced or eliminated. Of course, such reuse 55 ^P^a^e material on the hosting web site as it is changes on 

is subject to the copyright laws and must be the subject of content provider web sites. 

an agreement with the content provider of the source mate- It is another object of the invention to reuse web content 

rial. in a plurality of hosting site web pages each with a respec- 

While Web-based content exists in abundance, it is not appearance, 

necessarily easy to persuade a web content provider to share 60 It is another object of the invention to reuse web-based 

content on a low or no charge basis. This is e^ecially true content without requiring a content provider web site to 

for Web-based news articles, as these news articles typically modify content or install special purpose software, 

represent the major revenue generating content for the It is another object of this invention to enable a publisher 

publisher by carrying advertising banners above and/or of an electronic document to control the reformatting of the 

below the article text. Therefore, the web publishers are apt 65 document by a hosting site. 

to diarge a large amount for licensing the content to other These objects and others are accomplished by an auto- 
sites for reprinting. Each reprint represents a loss of revenue mated means for defining a filter used to extract web content 
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for a web page wherein the extracted content is used in a FIGS. 5A and 5B arc morc detailed flowcharts of a 

recast web page. The recast web page may be produced by preferred method of the processes which occur at the hosting 

a hosting site, or may be part of an effort to revise a web site server. 

at a web content provider. First, a set of pages, possibly a FIG. 6 is pictorial represeatation of a hosting filter defi- 

single page, is retrieved from a content provider web server. 5 nition interface. 

Next, the web page is parsed to identify a set of selectable piQ 7 ^ blodc diagram of the major components of the 

content elements. Next, a lepresentatioo of the original web j^^^ processing system unit on which the invention may be 

page is presented in a user interface, wherein the selectable practiced 

content elements are demarcated^pie i^^^ some pic. 8 is a diagram of a user interface based filter creation 

of the eleme^lssfor-mehision^m^he filter-t^ user lo ^^^^j invenUon. 

mteriace, whereby the Jool-wiIl mdicate_the selected content ^. r i- l j 

elects for inclfesrin the filter The tool ^nstructs^the FIG. 9 is a diagram of policy based pass through distn- 

filter so that when the^filter is used, the selected^^content uion. 

elements are extracted from.a.retrieved. web .pagJ^from the F^G. 10 is a flow diagram of one preferred embodiment of 

content pro^dCT wefe-seryer^and rcuse^^in the.recast web 15 Pol"^y based passthrough distnbution. 

P|ge.- As part 'of the process of identifying the selectable DETAILED DESCRIPTION OF THE DRAWINGS 
content elements, a set of varied headers can be used to 

retrieve multiple versions of the same web page. In this way, ^ representative system in which the present invention is 

the multiple versions of the web page are compared to implemented is illustrated in FIG. 1. A plurality of Internet 

identify static and dynamic content elements and marked as 20 ^^^^^ madiines 10 are connectable to a computer network 

static or dynamic Internet Service Provider (ISg) 12 via a network such 'as- a 

The filter finds particular application in distribution ^'f"l tel«=pho"=^t«'o* 14. ^ is well known, the dialnp 

mechanismformanagingcontentonthe Worldwide Webby ^^^P^'«^ network usndly has a given, hmited number of 

means of a filtering and formatting service located on a connections 16fll4 16/,. ISP 12 mterfaces liie client 

hosting server. Tte invention provides an automated system 25 machines 10 to the remainder of the networkJS^hich 

for recasting web content from a web content provider web mcludesjie hosUng server 19 and a plurality o|(jS5.^content 

site in the context of a hosting web site. At the hosting web ^^^^f ' machmes 20. A chen machine typicdly 

site, it brokers a cUent browser's request for a web page, f ^^"''^^ .V"*^ °^ ^""T ^^l^-^g ' ^eb 

analyzes the returned content and spUts it into component t-f"^' l^, to access the servers of the netwoA and thus 

elements, extracts the desired component elements, recasts 30 ohlmn certain services. These services mclude on6-toH)ne 

the desired elements in the look and feel of the hosting site messaging (e-mail), one-to-many messagmg (bulletin 

and sends the recast content to the requesting cUent as a web ^°'^)' and browsmg. Vanous 

page. Once the reformatted file is received at the cUent, the '"f ™f' Protocols we used for these services. Thus, 

client browser inteqjrets the HTML in the web page, pre- forexample.to>^«ing is effected usm^ 

senting the contentin the context of the hosting web site. On 35 i» P«"o<;ol (HTTP), which provides i^rs access to mul- 

the content provider's web site, the details of the transaction Ijmedia files usrng Hypertext Mi^Language OHML). 

in the web server logs are preserved, proxying a direct page ^u*^"^?* that use HTTP comprise the Wodd 

view and ad impression. retrieval"' t^ Internet's multimedia mformaUon 

The foregoing has outlined some of the more pertinent Ktneva systenL _ _ , , , 

objects and features of the present invention. These objects « , As shown in FIG^2, the^inventmn is a metto^ 

should be construed to be merely illustrative of some of the for exttacUng Web-b^contenl, especially, but not limited 

more prominent features and appUcations of the invention. Web-based news articks, from content provider or source 

Many otiierbeneflcial resultscan be attained by applying the ^eb sites foruse by the hostmg or "pass-through Web site, 

disclosed invention in a different manner or modifying the These arUcles typically are r^nue^^eMraUng c^ 

invention as will be described. Accordingly, other objects « pubhsher by carrymg^advertia^banners above and/or 

and a fuller understanding of the invention may be had by ^elow the article text. Therefore the pubMiers must benefit 

referring to the following. "'^ arrangement provided by the hostmg site to be 

interested in licensing their content for a low or no fee. As 

BRIEF DESCRIPTION OF THE DRAWINGS e^qilained below, the web content provider maintains his ad 

_ . . J . J- r • • SO revenue as the number of "hits" on the advertisements are 

For a more complete understanding of the present mven- ^^jained in a transparent manner. As the articles are also 

tion and the advantages thereof, reference should be made to ^ j ^ • * ^ « • 

tiuii aii« uj^ a aiiia^va .,.x,.iviiv^ oiiuu^^ " „ " posted at the hosUug site, ad revenues can actually mcrease 

the folio wmg Detailed Description of the Preferred Embodi- . ^. j . . < . i- . i- . 

\ ^ i-'^tau^u uiv IT ii^xs-in^ i-iiiuviux impressions are bcmg solicited from two sites 

ment taken in connection with the accompanymg drawings rather than one 

in which* 

* . . , . , , 55 During configuration, the pass through publisher 101 at 

FIG. 1 IS a representative system m which the present ^^^^^^ ^-^^ ^ provided with the URLs 105 for the 

mvention is implemented. ^^^^ ^^^^^^ provider web servers 107 and a set of filters 

FIG. 2 is a simplified block diagram of a requesting client, io9 for the content publisher's document templates 111. For 

hosting server and plurality of content provider servers ease in Ulustration, a single client 113 and a single web 

which aiustrates an overview of the process of the present ^ content server 107 are depicted. However, the reader should 

mventioa understand that a plurality of clients and web content servers 

FIG. 3 is an illustrative example of an unchanged source are typically interconnected through the agency of the 

web page as it would normally be presented by a client hosting site. Upon a request 115 from a client 113 for a given 

browser as retrieved from the content provider web server. web page, typically made through an HTTP request from the 

FIG. 4 is an illustrative example of the reformatted web 65 resident browser, the process for providing a page using the 

page as presented at the client browser after having under- pass through mechanism begins. Next, after having estab- 

gone the processing of the present inventbn. lished that the requested page originates at the web content 
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server 107, the hosting site makes a request 117 for the page. 
Presuming that this is a first request for the web page or that 
a more up to date version of the page is available at the web 
content provider than is cached locally, the page is returned 
119. In today's web technology, the web page is typically an 5 
HTML file with references to the component .wav, .mov, .gif 
and JPEG files which together make up the web page as 
perceived by the user. SecgnU^^p^gomp^nents^sudb^s 
cascad^g^^3^sheetsTand^aYa5a^j)let^^ 
modated-byjthcE^Dv^nUqq^-^ list^abbve'is merely excm- 
plary; any component on a web page can be extracted and 
recast into the hosting site context by the present invention. 

Next, the pass through publisher 101 retrieves the filter 
definitions and policies from the filter database 109 for this 
particular content provider web site. Using the filters and the 15 
retrieved HTML page, the pass through publisher 101 parses 
the HTML source for desired components of the page. 
Typically, this is the title of the article, the ad banner or 
banners and the article text itself, although other items on the 
page are potentially desirable. These pieces of content are 20 
then recast into a new web page by means of an HTML 
template 121 that matdies the look and feel of the hosting 
Web site. The new page includes the graphics of the hosting 
provider as well as the navigational features of the hosting 
site. This page is then sent 123 to the client 113 for 25 
presentation by the browser. In a typical web interaction 
between browser and server, once the browser receives the 
HTML page, it issues additional requests for the component 
files such as .gife, e.g., ad banners. For the ad banners 
themselves, the new page preserves the call 125 back to the 30 
content provider so that the correct adv^tising content is 
presented. It is common that each request~of a web page 
from a server can be refreshed with a different advertise- 
ment. 

In this way, the end user receives a page with graphic and 3s 
navigation features firom the hosting Web site that has an 
embedded article from the publisher and an advertisement 
served off of the publisher's site. The final result is content 
viewed by the end user in host site's native Web context, 
with an ad banner served from the original publisher, thereby 40 
preserving their revenue stream. 

It shouldalsobe noted that the article text is preferably 
cached^^^B^^^feon the hosting Web server 103, 
for fasSr^^^^ffi^ufffflleed access in the event that the 
publisher's Web site becomes inaccessible. The invention 45 
encompasses several variations in the types of information 
parsed from the page and cached locally. Some of this 
information may be incoiporated in the recast HTML page 
and some may be used for version checking. For example, 
information in the HTML header such as "last modified", so 
"content length" and "content type" could be kept with the 
article text so that the copy in the cache can be compared to 
the version available at the content provider site. However, 
in the preferred embodiment, the applicants have found it to 
be more efficient to simply compare the "last modified" data 55 
in the HTML header with the "last modified'* data in the 
hosting system's cache file. Remember that the hosting site 
103 makes the request 117 for the client to preserve the 
accoimting data for the content provider web site 107. Since 
the header data is among the first to be transmitted 119 in 60 
response, after a simple compare establishes that the cached 
version and the version currently available at the content 
provider web site are the same, the transmission 119 from 
the content provider can be ended. The hosting system 103 
then uses the cached copy of the article. In the event of no 65 
response firom the content provider web site, a cached copy 
of the artide is used. When there is no cached copy of an 



article, or the compare establishes that a more recent version 
of the article is available, the entire transmission 119 from 
the content provider is received for processing. 
Alternatively, rather than waiting for a client request, the 
'freshness' of the cadied content can be ascertained by 
automatically generating HTTP requests from the cached 
URLs and monitoring data in the HTTP headers when the 
page is hit in the background, updating the cache any time 
the web content provider changes their data. 

The aim of caching pass-through web content is to 
maximize efficiency by minimizing netw^ork; bandwidth 
requirements while preserving the transparency'of theltrans- 
actiiim. By caching copies of the parsed content on the 
hosting server, serving the content to the end user directly 
and simulating their *hit' on the publisher's site in the 
background, the end user gets content directly firom hosting 
site without having to wait for data to travel firom the content 
web provider's site to the hosting site. However, this method 
only assures a correct count for the web content provider 
whose adverU^ng, systems use a secondary HTTP request 
for the image retrieval to generate the ad impression. For 
systems that rely on dynamic HTML generation to log ad 
impressions, the ad content must be retrieved for each user 
and not cached on the host site. The static portion of the 
page, i.e. the article, however, can be cached, since it 
remains the same for each visit at least for a relatively long 
period of time. Serving the recast page to the end user will 
be delayed by the network for retrieving the ad content, but 
if the publisher's site becomes unavailable, the end user will 
not be affected. 

An altemativc embodiment to the invention is to provide 
a cUent based Java applet that retrieves dynamic content 
from the webcontent grovide^ass^^^^^^^^QlftBi^ 
user'sbroA^ 
the hosting site^ s cacaer ™ 

the Java app lefl— 

This reduce^m8BSSS^ Ej5S^^ ^|&t |tli^ iio^ site for 
dynamic HTML ad generation. 

Before describing the hosting process in greater detail, the 
reader's attention is directed to FIGS. 3 and 4 which 
respectively show the appearance of a content provider web 
page as originally sent and the recast web page as sent from 
the hosting site. It should be understood that the page in FIG. 
3 is never actually displayed by the client browser, however, 
showing the page as it would have been presented if the 
client had made the request directly to the content provider 
web site is useful to understand the principles of the inven- 
tion. 

As shown in both figures, the browser window 201 
bounds each web page and contains standard graphical tiser 
interface elements such as title bars, menu items and scroll 
bars. The browser shown is Netscape Communicator, show- 
ing that a standard client browser can be used unmodified to 
practice the invention. In the chent area 203 showing the 
tmmodified page, the logo banner 205, title area 207 and 
article text 209 are shown. Under the logo banner 205, a set 
of links 211 will retrieve other pages from the content 
provider server. Finally, at the bottom of the page, an ad 
banner 213 is presented. 

In FIG. 4, the recast page is shown in client area 303. In 
this example, the logo banner 305 is preserved, but moved 
to a new location (centered). The title area 307 and article 
text 309 have changed bcation, font and font size and Hne 
length. Other format changes are possible. Some, but not all 
of the links 311 to other content provider web pages have 
been preserved according to the poUcy for the web content 
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provider. Since these links may be important to the web publisher at the hosting web site, the requesting web client 

content provider to generate additional hits for other adver- information is recorded, and a request is made by the hosting 

tising revenue, the provider may wish to institute a policy web site to the content provider's web server on behalf of the 

that at least some of these links will be preserved in the requesting web client. The HTIT request to the web content 

recast page. The ad banner 313 appears at the bottom of the 5 provider server is similar tothatJV^^ch the requesting cUent 

page. Note also that navigational features 315 and 317 native co^l^ ^ <»ntent provider site direcUy, except with 
to the hosting server have been added to the page. A hosting site as the originator This assures that the web 

background border 319 giving the hosting web site a dis- '^J'^^''^ ^erve^'s log files record a visit by the requesting 

tincture look and feel has also been added. Of course, those ^^^^^ '^^'''^ ^ preserving the content -provid- 

skilled in the art will recognize that the examples of "desired lo er^s revenue stream. 

content" arc merely exemplary. TTie example of the top ad, ^ menUoned above, the hostmg site preferably ca^es 

article and bottom ad is common to many web news articles. contenUikcly to be requested by a chent^unprove the 

The invention allows the hosting site to extract and recast speed.and reUability of the hosting web site pages. In this 

any number or type of desired content elements from the way,.^ the document has not changed since the pass through 

web content provider page. 15 P^bhsher last polled the site, it is retneved from the local 

V p *u u * . cache after registering the "hit" on the remote server. This 
Depending upon the poucy for the web content provider, j j , j . . ' j • 

^ ™^ t5 u 1 * J • .u . reduces Internet bandwidth requurcmcnts and improves pcr- 

variations m which elements are preserved in the recast page - , i. *• u j *i. u 

, „ 1 t . J 1 formance on both^the hostmg web server and the web 

are possible. For example, the logo 305 is an optional ^ ^ 

c *- T. u J J J • • 1 J u content provider server, 

feature. It may be removed or reduced m size or replaced by ^ ^ . ^ . . . w,,^ 

a different logo stored in the filter definition. The links 311 20 However for the proce^ depicted in FIG. 5B, new 

are optional; they cotdd be removed, reformatted or relo- """^"^ ^ been retneved fern the web content provider 
cated. As a technical matter, the ad banner 313 is optional, ^^^P 451. Once the document content has been, 

however, from a practical standpoint tol,bt'ain content at a Z™-? P«'J'5"v 'he Mterrd<Ltabase b 

low Ucensing fee, they are probably mandatory. Other items searched for.the=approprute^fil^^^^ ^*P.^4«.J?e 
such as copyright notices are not shown in the figure, but 25 defimtion kept for the jyeb ^ cpntpl provide^^ 

could be or^er^ed mformation m the filter dennitwnwiU help the pass through 

_ , , . , . 1 . . publisher parse the document structure of the web page, 

The process by which a new pagc is registered mto the ^^^aing the dcsifid information, fa step 457, a test is 
hostmg system is depicted m FIG 5A. It begins m step 401, fon^ed to d4:tmine whether the parsing was a succe^. 

r^'". !. 5^1% ? "^'l '"^ff *f ""^ 30 If a filter deflniUon for the page or web content provider 

detected. Step 403 determmcs whether the page is from an . ^ c j *u * ** r • *l • f j ^i. 

^ , . . . * 4 • J u IS not found, or the first attempt using the associated filter 

existmg account, Le. an «isUng web content provider web ^^^^^^ ^^ ^ ^ ^ pass Irough pubUshet can 
site^ K not. a new account is starred step 405. TJe accomit ^^j^ ^ ^ ^^j^^, ^ 

or folder is a convenient place to store filter definiUons, 

poucies and any transaction information which pertams to a f , .u r j . » u .l -h 

articular content rovider present the reformatted content, however, the process will 

P . P ■ . ... not be as efficient as through an existing filter definition. 

The test m step 407 determmes whether it is a new page, ^ u^^j approach utilizes several methods, includ- 

either because of a new URL or new version, which has ^ ^^^^ fo, ^^^^ references to advertising engines, 
started the registration process. If it is not a new page, step ^ discussed below, the pnbUsher can also look for a set 

409, detennines whether it is a request to create or change ^ embedded tags indicating the desired content. Any docu- 

a filter defimuon which h^ started the registration process. ^^^^ ^ fllter-can Mt beifound for cin. be-togggd. 
For the purposes of this diagram, the pohcy for a content ^^^^ ktir create appropriate filter definitions. In 

provider is considered part of the filter definitions although ti^^ h^ver/tostini sites employing the pass through 

the information can certainly be kept in a separate file. The technique will be able to define templates appropriate to all 

process will exit m step 411 if there is no filter definition to "^i^oste^-. content. Most content provider sites employ a 

standard look and feel in their documents, allowing for 

In step 413, it is determined whether there is a suitable filters that are appropriate for large numbers of documents 

filler definition in the account folder for the content provider found on a particular web site, if not every document on the 

for the new page. As most pages in a web site share a entire provider web site. 

common format and style, it is envisioned that a relatively 50 j^^^ excerpted components are then run through the 

small set of filter definitions can be used for all of the pages pass-through publisher's "post-processing" system to assure 

from a particular site. If there is no existing fiher definition that they do not contain "dangerous" formatting code frag- 

suitable, in step 415, a new filter definition is created for the ^^^^ adversely effect the hosting web site, step 

page. There is more discussion on the creaUon of filter 4^1, example, >^en articles are extracted from within a 

definitions and policies below in connection with FIG. 6. 55 ^ABlE structure, HTML TABLE fragments could be left in 

In step 417, the page, i.e. URL is associated with the the filtered HTML that could destroy formatting on the 

appropriate filter definition and in step 419 the appropriate hosting web site. As another example, interactive or browser 

changes to the account, URL and filter definition files are dependent scripting code could be found in the filtered 

made. Optionally, the newpage can be processed and cached HTML that may not make sense in the document's new 

as part of registration. Thus, in step 421, the filter definition go context. The post filtering tasks should also inchide fixing 

is used by the pass through publisher to extract the desired any relative URLs embedded in the original web page to 

portionsof the page. In step 423, these portions of the page preserve their original function. Optionally, this can be 

are cached for retrieval in the event of a client request. The accomplished by pointing the URLs to the hosting site for 

process ends, step 425. handling. For example, many documents are split into 

In FIG. 5B, the process for parsing and reusing web 65 several pages by the web publisher. The link to the next part 

content by the pass through publisher is shown. When a of the article can be translated to a hosting site link so that 

chent requests a new document from the pass through the next part is automatically served in the hosting site's 
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context. The relative link could also be translated to an A CGI or other program can be used to create filter 

absolute link so that it will still lead to the content provider definition files. FIG. 6 shows a user interface in which tags 

server even when selected in the recast page. As would be or text can be entered manually so that the pass through 

readily understood by those skilled in the art, these post publisher can more easily parse a web content provider's 
filtering tasks could easily be performed by one of the filters, 5 web pages. In the browser window 501, client area 503 

however, the applicants have found it to be convenient to contains a plurality of controls for a set of desired compo- 

separate the tasks thus simphfying the construction of the ^^^^^^ E^try fields 505, 507, 509, 511, 513, 515, 517, 519 
filter definitions. 521 are rc^ectively used to enter the filter name, the 

The component HTML file, once extracted, separated, and i^g^ ^^^^^ ^ copyright string, a beginning of the top banner 
post filtered is then reformatted into a new document in the e^jy^g ^^p banner ad, the beginning of the article 

style and context of the hosting web site, step 463. This is jg^t, the ending of the article text, the beginning of the 

done by another component of the pass through publisher, a {jqHojxx ad and the ending of the bottom ad. Note that certain 

web publishing appHcation that creates a "dynamic publish- ^^^j^^ ^ ^ogo name and copyright string could be 

ing template". The web pubUsher injects the excerpted replacements for those which occur in the web page, rather 
content, tiUes, copyright statements and logos as received ^5 indicators of the desired content, 
from the post filtering proce^. In step 465, the desired ^ ^^^^^ ^^^^ ^^^^^ ^ 

components are cached, which may mclude components indicate which of these items he wishes to keep on the recast 
usefiil in determinmg the version of a web page, but are no ^^^^^ ^^^^^ 5^5 in^cate whether 

usca in ine recast page m step 407, itie recast page is sent j^^jj^ formatting shoiUd be stripped from certain areas of the 
to the requesting client. The process ends, step 469. Once ^ ^^^^^ ^^^^ ^^ 

presented by the requesting browser, the content or the • c u i-i- ij « *u * c * j * 

r ,^ ^ 1 i wt- L -x in field 527. Field 529 allows the entry of custom code for 

hosting web site appears seamless to the user, although it o,^ . j u u • * • j *u j <: j o 

. . , . 1 1-x f L . X -J filtermg code behaviors outside the predefined filters. Spe- 

may origmate at a plurahty of web content provider sites as . , ° ^-^^ j*ju jj- • 

II t- : •/ tr cial cases can-be accommodated by adding a nincuon m 

weU as the hosting site itsetf. Perl, JavarJavaScript :or?~a-s|8daizedirter^^^^^ 

Since the code from the original content has been 25 guage. .Push button 531 allows the user to chaige to a 

abstracted and separated from its style and formattmg, it is different filter definitioa 

now possible to format before sending it to the user in any „ . ^, , « • . . , . , « . . 
of a variety of styles. This can prove useful in a variety of ^ ^ach filter definition is stored in a filter defimUon data- 
situations. It is common for the web sites of several smaller ""^^ accessible by the pass through pubhsher. The pubhsher 
organizations to be "hosted" by an organization with the 30 the filt« defimtion to break the content mto compos 
technical expertise and capital equipment allowing the parts: The htle area prmary and secondary advejtee^^ 
smaller organizations to concentrate on creating the content «'°»«"' I'^elf. The title aR» mchd« theWof the 
forthewebsitesratherthanthedetailsofmaintenanceofthe P'S^ f"^ ^y^'^^V T**^ '*8S. 1^ 
server machines. A single pass through publisher could P"""^ and secondary advertisements usually occur at the 
provide a different look and feel for each of the different 35 web page but may be locked at 
organizations hosted on its web servers. Alternatively, a ^^"""^ loc».hons. Ihey are typically marked m the HmL 
single hosting web site could provide several different by tags or comments mdicatmg an adyerUsement Depend- 
altemativc formats. The choice of which format to present to ™ ^i*"""^ f^^'""^ f '''' *^ ^1' 
a particular user could be based on the organization or ^"^.""e cross-publishing a^eement with 
location associated with the user. Alternatively, the web site 40 «»ntent provider, i.e. allowmg for repub jshing certain 
could allow the user to choose from among the different^ types of web content but not others and the filter the content 
formats based on a registration of his preferences in a "^"y ^ """y. ^''f^ filter may stop out any 
profile. Thus, the look and feel of a web site can change extraneous hnks or "sidebars of mfonnaUon. AltemaUvely 
dependent upon the requesting audience. {J^ 1 web a%'' '™ '^^^ " 

The invention provides a mechanism which allows a 45 ^ ongm we page, 
hosting web site to provide a wide variety and great amount addition to providing the system with information on 
of third party Web content without incurring high licensing separating the components of the document, filter definitions 
costs. Another benefit of the pass through system is in cost ^^^^^^ publisher specific information such as the logo 
savings. Unhke a traditional system of licensing and repub- copyright statements and policies that should be used by 
lishing content, the hosting system does not require a large 50 ^^^^ pubhsher when formatting tiie new version 
production staff since the republishing and re-styling of the °f document- 
content is automatic. A hosting system can provide a much Alternatively, the logo and copyright statements could be 
faster production cycle and assure that the content does not excerpted components like the title, ads and oontentu 
quickly go "out of date". The .filt ert.definitibns ::can^^ inchide;^ the? ^^polii^for a 



A discussion of filter definition creation follows. The 55 particular^weiS^SISienfpn)viderrAny"number of policies can 

collection of document filters help the pass through engine be established based on publisher, article, article section or 

understand the structure of a wide variety of web documents. any other distinguishing criteria that can be identified. 

The document filters can be created through several Policies might govern whether content is licensed for use on 

methods, including the analysis of the HTML source code, an intranet, but not on the Internet, or vice versa, or both; 

imbedded comments or delimiters and through comparisons 60 how many times a document may be served off a host site; 

with similar documents. Once the style of the web site is whether the publisher's ads should be passed through or not; 

imderstood, a filter can be developed to look for the portion what kind of cadiing strategy should be applied; what cost 

of the original document in which the hosting site is inter- each view of the article carries for the host site; and so on. 

ested in reformatting. Inconsistencies in document style or The ^ecific types of policies available will depend on the 

structure can be neutralized by the use of custom code 65 context in which pass-through is being used, whether as a 

imbedded in the web page and detailed in the filter defini- commercial product, integrated into custom solutions, or 

tion. bundled with other products. 
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The client machine may he a pe'^dhal computer such as generally as described above. Until required by the com- 

a desktop of notebook computer^ e.g., an IBM or IBM- puter system, the set of mstructions may be stored in another 

compatible machine running under the OS/2® operating computer readable memory, for example, in the hard disk 

system, an IBM ThinkPad® machine, or some other Intel drive 726, or in a removable memory such as an optical disk 

x86 or Pentium®-based computer running Windows '95 (or 5 for eventual use in the CD-ROM 732 or in a floppy disk for 

the like) operating system. Of course, the invention may be eventual use in the floppy disk drive 727. Further, the set of 

run on a variety of computers or collection of computers instructions can be stored in the memory of another com- 

under a number of different operating systems. The com- puter and transmitted in a transmission means such as a local 

puters on which the client^software and the hosting and area network or a wide area network such as the Internet 

content provider web site reside could be, for example, a when desired by the user. One skilled in the art knows that 

personal computer, a mini computer, mainframe computer or storage or transmission of the computer program product 

a hand held computer. Although the sp^ggific^h^ke; of changes the medium electrically, magnetically, or chemi- 

computer is limited ^ gn ly by prpjcessprsueM^ cally so that the medium carries computer readable infor- 

requiremcnts, itj§^;typi^^at the client con^^er be mation. 

somewhat^^^terw|ighf; than the ^bjeryer^ofij^ters. Further, the invention is often described in terms that 

For example, computers in the IBM PC'series of computers could be associated with a human operator. While the 

could be used a clients in the present invention. One operations performed may be in re^onse to user input, no 

operating system which an IBM personal computer may run action by a human operator is desirable in any of the 

is IBM's OS/2 Warp 4.0. For the web servers, the computer operations described herein which form part of the present 

system might be in the IBM RISC System/6000 (TM) line invention; the operations are machine operations processing 

of computers which run on the AIX (TM) operating system. electrical signals to generate other electrical signals. 

In FIG. 7, a computer 710, comprising a system unit 711, As used herein, "Web client" should be broadly construed 
a keyboard 712, a mouse 713 and a display 714 arc depicted to mean any computer or component thereof directly or 
in block diagram form. The system unit 711 includes a indirectly connected or coimectable in any known or later- 
system bus or plurality of system buses 721 to which various 25 developed manner to a computer network, such as the 
components are coupled and by which communication Internet. The term "Web server" should also be broadly 
between the various components is accomplished. The construed to mean a computer, computer platform, an 
microprocessor 722 is connected to the system bus 721 and adjunct to a computer or platform, or any component 
is supported by read only memory (ROM) 723 and random thereof. Of course, a "client" should be broadly construed to 
access memory (RAM) 724 also connected to system bus 3Q mean one who requests or gets the file, and "server" is the 
721. A microprocessor in the IBM PC series of computers is entity which downloads the file. Moreover, although the 
one of the Intel family of microprocessors including the 386, present invention is described in the context of the Hypertext 
486 or Pentium microprocessors. However, other micropro- Markup Language (HTML), those of ordinary skill in the art 
cessors including, but not limited to. Motorola's family of vWll appreciate that the invention is applicable to alternative 
microprocessors such as the 68000, 68020 or the 68030 35 markup languages including, without limitation, SGML 
microprocessors and various Reduced Instruction Set Com- (Standard Generalized Markup Language), dynamic HTML 
puter (RISC) microprocessors such as the PowerPC chip and XML (Extended Markup Language), 
manufactured by IBM might be used by the present inven- Moreover, while the preferred embodiment is illustrated 
tion. Other RISC diips made by Hewlett Packard, Sun, in the context of a dialup network and the Intemet, this is not 
Motorola and others may be used m the specific computer. 4^ ^ limiution of the present invention. The invention can also 

The ROM 723 contains among other code the Basic be implemented in an intranet environment where a large 

Input-Output system (BIOS) which controls basic hardware organization may have several content provider units which 

operations such as the interaction of the processor with the provide content for content using units which target different 

disk drives and the keyboard. The RAM 724 is the main customer segments and have different trade identities. Thus, 

memory into which the operating system and application 45 while the content using imits may utilize much of the same 

programs are loaded. The memory management chip 725 is information, each will want to recast the information in a 

connected to the system bus 721 and controls direct memory different look and feel to project their own trade dress, 

access operations including, passing data between the RAM Filter Definition 

724 and hard disk drive 726 and floppy disk drive 727. The As mentioned in incorporated by reference application, 

CD ROM drive 732 also couple^ o the sy stem bus 721«is*5o Ser. No. 09/113,678, entiUed "Distribution Mechanism For 

used to^store a l arge pn3ffl am^or|Wourit*5f'data, e.g., a Filtering, Formatting and Reuse of Web Based Content**, 

muiBmediaIi)j2gfanjrodpre'sy^ there are many possible approaches to parsing filters for the 

Also connected to this system bus 721 arc various I/O invention. The invention discussed in this section is con- 
controllers: The keyboard controUer 728, the mouse con- cemed with the automated creation of filter definitions for 
troUer 729, the video controller 730, and the audio controller ss the distribution mechanism for a given set of Web pages. 
731. As might be expected, the keyboard controUer 728 For predictable sets of documents, a number of 
provides the hardware interface for the keyboard 712, the approaches are possible and more or less straight forward. It 
mouse controller 729 provides the hardware interface for is possible for a user to generate a filter by coding or to use 
mouse 713, the video controUer 730 is the hardware inter- a program such as the CGI program discussed above with 
face for the display 714, and the audio controller 731 is the eo reference to FIG. 6. However, either approach requires 
hardware interface for the speakers 715, An I/O controller sufficient research into the "typical" web page style or 
740 such as a Token Ring Adapter enables communication format at a given site. Given that a typical web page can 
over a network 746 to other similarly configured data contain 100 kilobytes of information, performing this 
processing systems. manual comparison can be quite difficult, time consuming 

One of the preferred implementations of the invention is 65 and error prone. Further, because the typical web site 

as sets of instructions 748-752 resident in the random access undergoes a continual renewal, changing content each day, 

memory 724 of one or more computer systems configured manual comparison is probably not terribly practical in most 
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situations. The present invention provides an alternative for E. Hie description continues to the end of the document, 

filter definition for the distribution mechanism. One skilled in the art would recognize that there are a 

An overall diagram of the document parsing and filter number of ways in which this data oould be organized. One 

creation process is shown in FIG. 9. The first step lo creating preferred data structure is given below, 

a filter for a set of documents is to point the parsing engine 5 

801 of the filter creation application at a representative Preferred data structure for textographical map: 

member 803 of the type of documents for which the filter Use XML tags to delineate begin and end of each * chunk', 

definition is intended through use of its URL. It is possible embedded in the HTML source of a representative page, 

to generate a filter definition from a single page or from a in the following form: 

collection of pages. Generating the filter from a single lo <PASSTHROUGH: CHUNK NAME-"[chxmk number]" 

selected page is the preferred embodiment for a web site or TYPE-"[DYNAMIC|STAnCJ*> [html content] 

section of a web site which is written in a consistent style. <yPASSTHROUGH:CHUNK> 

However, since a single page may not always be represen- where [chunk number] is a number in a sequence that is 

tative of the site or section as a whole, in some cases, it is incremented with the discovery of chunks in the page, i.e. 

preferred to use several pages to derive the filter definition. 15 1,2, etc., and TYPE is either DYNAMIC or STATIC. 

Because the hosting site is not generally interested in hosting Note that this CHUNK tag will be replaced by a 

all of the pages from the content provider web site, some sort PASSTHROUGH XML tag that identifies the actual con- 

of constraint is used to select the pages used to derive the tent of the chunk, once the user is done creating the filter 

filter. The constraint can be accomplished by indicating to . . . Le. chunk 1 may become top_ad_banner. 

the parsing engine lo select a certain number of pages, e.g., 20 

five, from a particular path, e.g., www.news.com/news/*, or This information is sent to the user interface builder 803 to 

manually selecting several pages from the appropriate sec- present a representation of the page from which the user can 

tions of the web site. select identified components to be passed through the filter 

Alternatively, a starting page and certain number of and republished through the distribution mechanism, 

traversals from the page can be given to select a group of 25 The data of the static and^dynamic components is then 

pages so long as the travsersal stays within the pages usedto^display a version of a representative document to the 

controls by the web site. For example, the user gives the user, ^The actual {iresentation of the user interface could be 

URL www.ibm.com/products/Aptiva and a traversal range perfonned^t the hciStiiig server, but would more typically be 

of 3 to the parsing engine. This means that the parsing performed at a client<^machine associated with the hosting 

engine will select the initial page and pages on the www.ib- 30 server: 'One of the documents used to derive the information 

m.com web site whidi are within three hyperlinks from the can be arbitrarily selected for presentation. Preferably, the 

starting page to create the filter definition. interface looks visually identical to the actaal docu^ but 

The parsing engine or its filter agent then retrieves from is in fact a clickable diagram of the stmcture of the docu- 

the URL a number of pages, varying the headers such as ment. Borders can be presented around each element prior to 

user-agent in its request each time, and stores the results of 35 and/or after selection to aid the user in understanding the 

each retrieval. Even when using a single page to derive the boundaries of the component. The placement of the identi- 

filter, the headers are varied to reflect a representative fied elements and the borders is straightforward task given 

sample of the types of browsers that would be expected to the information present in the HTML of the page. Once the 

visit the source site. The purpose is to determine if any of the element is placed, a slight ofOset around the element can be 

content on the page is created dynamically according to the 40 used to place the border. 

type of client visiting the site. For example, content may The interface determines whether eadi mouse click is 

change according to the manufacturer or version of the located in one of the stat ic^ormd j panri cioQmpo'fleBts of the 

user's browser or its capabilities; advertising content may be document. If a mouse click is detected in an identified 

selected based on domain; message text may vary according element, a border is drawn around that component, visually 

to time zone, IP address or other user environment infor- 45 identifying it for the user. If there are already borders drawn, 

mation contained in the request headers. )\^th a stored set of the selected element can be highlighted, e.g., change back- 

the contents of the URL, the parsing engine then compares ground or border color, in some manner. The user then 

each member of the set with the rest of the set, looking for assigns a label to that element such as "Article_text" or 

differences and similarities. Similarities mclude portions of "top_ad_banner" or the like, and the element label or 

text, graphics or other content that do not vary each time the 50 definition is then recorded in the database. A pop-up menu 

page or pages load. The similarities will be identified, as will of element labels can be presented for this purpose. The user 

portions of the page and pages that do vary in some way, i.e. continues defining elements until he or she is done. Unused 

the differences. chunk information is discarded as irrelevant to the context of 

With these comparisons, the parsing engine 801 builds a this filter, 

topographical map of the data, which results in a description 55 Preferably, the label which the user used to identify the 

of the document's static and dynamic components. An component is the identifier used by the template of the pass 

article on a typical Intemet news site would be an example through distribution mechanism at the hosting site. As 

of a static piece of content. It remains the same no matter discussed above, the template is used to tell the pass through 

who visits the site. Ad banners on the same type of site are engine how the components of source pages are to be recast 

the obvious example of dynamic content. Eadi visitor gets 60 into the hosting site's pages. Alternatively, however, an 

a different ad banner according to the advertiser's contract additional step is possible wherein the user associates the 

with the publisher. label of the page element with the identifier used by the 

In narrative form, the topographical map would read distribution mechanism, 
something like this: The document at URL X starts with As part of the parsing process, the parsing engine exam- 
static component A followed by a dynamic component B 65 ines the tags associated with each of the elements. These tags 
that ends at a static component C. This is followed by a tend to be common, even between page layouts whidi 
further dynamic component D that ends at static component appear dramatically different. Thus, the parser 801 can 
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provide a best guess as to what the appropriate label for the 
component would be. This can be accomplished by refer- 
ence to a table in the parser which has a mapping of common 
HTML tags to element labels. This information can be 
passed to the user interface module 803. When a component 5 
is selected, the best guess label is presented to the user. This 
can be done by highlighting the best guess label in the 
pop-up menu of component labels described above. 
However, those skilled in the art would recognize that the 
contents of each identified component can differ greatly and 
that it may not be possible to meaningfully associate a set of 
tags with each type of component. Thus, the "best guess" of 
the filter definition parser, must, in the opinion of the 
inventors, be augmented by a human user, which is, of 
course, the point of having a user interface. 

Next, the user defines the publisher associated with this 
URL and that is recorded in the database as well. Finally, the 
user specifies the pages or the URL pattern for which the 
filter should be used. By default, the filter will be used for 
the page or pages which were used to identify the compo- 
nent. However, the user may wish the filter to be used for a 20 
large section, if not all, of the documents from the web site, 
[f the URL for the selected page is edited to remove the 
document portion or part of the path, a larger set of docu- 
ments will be processed with the filter. One preferred means 
of indicating that the filter is to be used by any page from a 25 
given domain name is the use of a wildcard character. For 
example, www.domainl.oom/news/*, where * is a wildcard 
character would indicate that any web page from the news 
section of the www.domainl.com domain was to be filtered 
using the just defined filter. 30 

Once the filter definition and the scope of its use is 
defined, the filter definition is stored in a filter database 805 
with other filter definitions to be used with pages from other 
domains. 

Now that the process has been described, certain details of 35 
the data schema used in the preferred embodiment are 
discussed below. 

In one preferred data schema, the fields for storing a filter 
m the pass-through database would be as follows: 

40 

URL — the URL of the document to be passed through this 
would normally be a partial URL that would match a 
number of documents, i.e. <http://www,publisher.com/ 
new^articles* would match any URL that started with 
that string, such as <http://www.publisher.com/news/ 45 
articles/ecommerce .html> http ://www.publisher.com/ 
new^articles/ecommerce.html 

CHUNK — each filter would contain an arbitrary number of 
component or element records, the stmcture of which is 
defined below. so 

PUBLISHER ID — this field would contain a numeric or 
alpha id that would refer to the publisher's entry in a 
separate database. The schema for this database is below. 

POLICY ID — a publisher can have more than one policy 
associated with it, so it is necessary for the filter to 55 
associate itself with a publisher ID and policy ID as well. 
It would be a numeric id that pointed to a policy in the 
policy database. 

Chunk (Element) Definitions 60 
Each chunk would contain the following fields: 

LABEL — the identifier for the element such as 'Top_ 
Banner_^Ad" or "Article_Text" to be used when assem- 
bling final pass- through page. 65 

START — the static data that signifies the beginning of this 
particular element. 



START_N— the nth occurrence of START For example, if 
START is "<IMGSRC«.?/graphics/divider.gif?>" and 
START.JI is 3, the chunk would start with the 3rd 
occurrence of the START text found in the document. The 
default value is 1. 

KEEP_START — ^is a Boolean filed to instruct the pass 
through mechanism to keep or discard START text as part 
of the component. 

END — 4s the static data that signifies end of component. 

END _N — nth occurrence of END is relevant as it identifies 
the instance of the landmark to use as the end of the 
element. For instance, there may be several end of table 
cell tags (</rD>) in a given component, and the fifth such 
tag encountered should be considered the end of the 
component in question. 

KEEP_END — is a Boolean to keep or discard END text as 
part of the component. 

SPECIAL — contains the name of custom post-processing 
code to which to feed the component. 

The SPECIAL field exists to provide for future adaptation 
or unforeseen difficulties. If, for example, a source site 
begins using other XML tags in their HTML pages that are 
being passed through, those XML tags are ending up in 
components that are being extracted and those XML tags are 
interfering with the layout of the final assembled page, a 
SPECIAL field could be used to add a Peri or Java filter that 
strips those specific XML tags from the extracted compo- 
nents. 

In one preferred embodiment, each entry in the Publisher 
Database would contain the following fields: 



ID 


uniqxic numeiic idcntifLcT of the filter. 


NAME 


Publisher's business name. 


URL 


URL of the publisher's site. 


LOGO 


binary file to be inchided on pages passed 




thmug^ ftom the publisher's site and linked 




with the URL. 


CONTACT 


name, phone, and e-mail address of contact 




person for pass through. 


POUCY 


numeric ID of policy set that pertains to this 




publisher's content in Policy Database. 



Although the filter definition process was explained in the 
context of the overall pass through mechanism, the filter 
definition is useful in other circumstances where web con- 
tent is filtered, allowing selected components of the filtered 
web pages to be reused for other purposes. For example, a 
similar filter can be used a means to rejuvenate the look and 
feel of a web site to a new format. The selected content from 
old pages would be recast into new pages, all at the web 
content provider's web site and distributed direcUy through 
calls to the native web site, rather than through the pass 
through distribution mechanism. 
Policy in the Distribution Mechanism 

All the varioTis components of the pass-through mecha- 
nism are preferably tied together and overseen by sets of 
rules or policies defined for or by each publisher or web 
content provider. In one preferred embodiment, these poli- 
cies are kept at the hosting server in a 'publisher database* 
which represents a collection of information regarding every 
aspect of the data sources as they pertain to the pass through 
distribution mechanism. 

Understandably, as the content developed by the web 
content provider represents a great deal of intellectual 
capital, the provider will be interested in specifying control 
of the pages. Through the use of the policy database, the 
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hosting server caa comply with policies specified by the web contained in the XML tag, the policy data in the XML tag 

content provider as to who is allowed to sec what when. For overrides the general policy for the lagged data. This allows 

example, the web content provider may wish to restrict content provider finer control of particularly valu- 

republication of his data according to the ori^n of the ^^^^ permitting a general poUcy which 

requesting client. Some publishers will allow all of their 5 _ . j * TnT i- j * i *u 

coitent to be disseminated widely throughout the Internet so t^jj'^ ni^)°nty of its data TTie policy daU from the 

long as they receive advertising income from the ad hit ^ML tag may be stored m the policy database for archival, 

counter. Other publishers will wish to restrict distribution of accounting or other purposes. For example, the hosting site 

the web content within an intranet or extranet or to certain may be contractually required to show the web content 

IP addresses. Further, it is possible to specify which portions provider that the recast web content was distributed accord- 

of the database can be seen by a specific class of user. For ing to the policies set by the web content provider, 

example, the web content provider may allow text in an „ ^^^^ requesting cUent is permitted the web content, the 

artide to be recast and redistnbuted by the pass through ^ ^^^^ ^^^^^ ^ ^^^^^^ ^^^^^ 

mechanism to anyone requesting it in the Internet. However, ^- . t 1 • . ^ . , 

the provider may wish to restrict a graphic to a certain Namely, the pass through mechanism excerpts the desired 

specified group of users because of the effort required to ^5 mformaUon from the web page and recasts the inforaiaUon 

produce it. Yet further, the web content provider may be according to the preferred format of the hosting site, 

willing to allow the hosting site to recast its content at a in the recasting process, the policy data in the XML tag 

certain time lag, thus preserving an advantage for presenting m^y be consulted. It could potentially contain instructions 

the very latest material at its own site. WhUe this may not be ^ow to format the tagged data. For example, when an 

preferred from the slandpomt of the hostmg site, stiU the 20 ^o-date version of a graphic or article can not be 

delayed content is better than no content whatsoever, and ,7 , , l • * * u 

/. . ^ . . r.- J ix J.u displayed, the policy may have mstructions for the hosting 

certamly at a much lower expenditure of time and effort than r j r j j ,,r— ™ . 

writing a page &om scratch. Still further, a web content ^^^^^ ^^^7^' ^""^^^ request or usmg a 

provider may have multiple policies for the web site, one pi^vioi^ cached version a back level version of the graphic 

poUcy for a respective set of web pages and a second policy 25 «^ ^^^1^' *^^^,'°Pl^ P^^^^y ^^^^^ ^ "^t^^' 

for a second respective set of pages. ^ "^^ert a link m the recast page together with text 

The mechanisms by which the invention enforces these indicating that "This information is delayed by three hours, 

and other policies are described below. As shown in FIG. 9, For a more up-to-date version, please click here." The link 

a web page 901 resident in a web content provider web could bring the user to the web content provider web site, 

server 903 has been requested by the hosting server 905 as 30 In an alternative embodiment to that described above, a 

the result of an HTTP request by a web client 907. In the finer degree of control for respective components of the web 

diagram, a special XML tag 909 is included in the HTML page is possible without the use of the special XML tags, 

which makes up the page which specifies or otherwise However, for a given page, a certain amount of coordination 

identifies the appropriate policy This XML tag 909 defines between the filter definition and the poUcy in their ^edfic 

the boundaries of the data for which the policy applies, the 35 databases is required. As noted above, the filter definition 

Identity of the data and may also contam policy data for the drably contains a reference to the appropriate poHcy for 

data. The policy data can mclude information such as cost of „ c .u- u *: .„uvu ;™u 

/ J- r 1- ^- • -1 J ^ section OE the web site from which the web page ongi- 

recastmg/redistribution, access privileges and copynght ^ . ^ . , a A ^ . 

, c A' Au • l u !• fr^r nated. The policy retrieved may specify different treatments 

information. Alternatively, it may simply be a policy ID for - ^. i.j /^.l^i. jc- 

the appropriate poUcy in the poUcy database 911 coupled to 40 the respective selected components of the filter defim- 

the hosting server 905. Although only a single tag is shown component can be said to have its own 

in the figure, multiple tags can be present, each specifying a P^^^^^* ^ specific component on a specific page needs 

respective, possibly different, policy for the web page data special treatment, a filter definition for that specific page can 

it respectively protects. developed and used in the pass through distribution 

Once the web page 901 is retrieved by the pass through 45 process. Alternatively, the filter can reference several poli- 

agent 911 at the hosting server 905, the XML tag is identified cies and indicate which apply to respective selected com- 

through the parsing process. The data boundaries, data ID ponents. 

and policy data are extracted and are used to assemble the The publisher database in one preferred embodiment 
recasted page. If a policy ID is specified, the corresponding actually comprises two databases: one with basic publisher 
policy is retrieved from the policy database. Alternatively, if 50 data, another that defines policies that pertain to the pub- 
no poUcy is specified in the tag, the URL from which the Usher's content. In this embodiment, the schema for the 
web page was retrieved is used to retrieve the appropriate databases look like this: 
policy. The policy data, both from the tag and from the , 
policy in the policy database, is used to determine whether Information 
the hosting site has permission to recast the web page to the 55 

requesting client The client specific data which is included 

in the client request such as IP address is matched against the id unique numeric identifier 

policy for web data or publisher. Other types of client name publisher's company or dba name 

specific data include client operating system, browser manu- site 

facturer and version, browser capabiUties, e.g., JavaScript, 60 COhHACr con^ person information (Name, phone number. 

Stylesheets, domain and the referer document which indi- poucies list of policies by id that pertains to the 

cates the source URL from which the link originated. If the publisher's content. 

client specific data is not contained in the initial request, the COPYRIGETT text to append to each page passed through 

hosting server can make a query to the client for the needed to image to place on passed 

, ° , . . ^ through pages, blank if no logo u to be 

data, e.g., authentication. 65 displayed. 

If there is a conflict between the general policy for the — — 
web site stored in the policy database and the policy data 
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Policy Records 



ID unique numeric id 

URL (portion of URL to match, eg., 

hitp'J /news.web&bead.ibm.oom/news) 
NUMBER e.g., -1 for no limit, 0 or more for quota 



OF PAGE VIEWS 

CACHE TYPE 0 no cache, 1 cache entire document, 3 cache 
nanaed chunks 

CACHE CHU>fKS names of fields that should be cached, 

i.e. "BODY" or "ARTICLE" 
DISTRIBUTE 0 single or specified sites, 1 intranet, 2 

extranet, 3 universal 15 
COST PER VIEW in dollars, 0 for no accoimting charge 
MIN AGE invalidates request unless document is at least a 

given age old. Attribute would direct hosting site to 

alternate content, e.g., <MIN_AGE-5_JI0URS, ALT- 

"oldversion.html"> 20 
ATTRIBUTION Includes link, logo and attribution text to 

be added at the bottom of a recast web page or selected 

component. 

Those skilled in the art will appreciate that other schemas are 25 
possible for storing the policy information. For example, the 
schema above assumes that a single policy vwll suffice for all 
the elements in a page which are retrieved by a single HTTP 
request. This will generally work since the elements which 
are likely to require different policies, text vs. graphics, are 30 
generally called by separate HTTP calls and so can have 
separate policy IDs. However, in the alternative, rather than 
have the policy IDS associated with a given URL, they can 
be associated with a specific content element to specify 
different policies for respective components of a given web 35 
page. 

In one preferred embodiment, if a filter definition did not 
call out a specific policy ID, then a default policy definition 
would be used. This definition, rather than ascribing to the 
special preferences of a particular web content provider, 40 
would follow the needs of the hosting site. Generally, it 
would have no limit on the number of views of a page a user 
could request nor would have any limitation on the type of 
requesting client who could receive the page. Caching 
would be performed as was most efficient for the hosting site 45 
to give the best apparentj;peed-to the requesting user. 

The flow diagram depicted"^in FIG. 10 illustrates the 
process discussed above. In the illustrated process, it is 
assumed that the client has made a request to the hosting site 
for a pass through document, i.e. a web page from a web 50 
content provider rather a native page stored at the hosting 
site. In step 1001, the web page is retrieved by the hosting 
site. Each request for a pass through document will cause a 
search in the filter and policy databases for records which 
match the URL or main portion thereof of the retrieved web 55 
page. When such records are fo;md, they are retrieved, step 
1003, As discussed above, the filter definition and policy 
contain information such as the publisher's id, the filter id 
and the policy id and associated data for the web page. The 
filter definition is used to parse the web page for the selected 60 
components which will be recast by the hosting site. In 
addition, the parsing step 1005 looks for the special XML 
tag discussed above. 

If the XML tag is found, step 1007, additional processing . 
ocous. In step 1009, the tag is used to identify the affected 65 
component within the web page and the boundaries of the 
component In step lOU, the policy data associated with the 
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affected component is extracted and processed. In step 1013, 
any policy differences between the policy detailed in the tag 
and the policy definition from the policy database are 
resolved. In the preferred embodiment, they are typically 
resolved as specified by the tag. In step 1015, any special 
formatting data in the tag is extracted for future use. 

The data in the policy definition is matched against the 
client specific information associated with the client request, 
step 1017, and is used to test to sec if the pass through 
distribution should be allowed. In step 1019, a test is 
performed to determine whether there is enough client 
specific data to determine whether the client request should 
be fulfilled. If not, in step 1021, the client is queried for the 
needed data. In step 1023, the client returns the client 
specific data required. The lest in step 1025 determines 
whether the requesting client should be permitted access, i.e. 
whether the pass through distribution should take place. If 
so, in step 1027, the selected components are extracted from 
the web page according to the filter definition, the extracted 
information is formatted according to the hosting site's 
template and any additional formatting information in the 
tag or policy definition. In step 1029, the recast web page is 
sent to the requesting client. 

The logging step 1031 can vary greatly in complexity 
depending on the particular implementation of the invention 
and the policy associated with the web page. As mentioned 
above, diere may be a cost associated with distributing the 
web page. Thus, there needs to a log of the transactions 
which can be associated with particular requesting cUents, or 
the hosting site itself, so that these fees can be assessed 
accurately. Also, as noted above in the policy definition 
sdiema, there may be a limited number of times that 
particular web data may be viewed through the pass through 
mechanism before the requesting client is requested to 
access the data directly from the web content provider's web 
site. Logging the number of times that a client has requested 
the data facilitates an additional test, e.g., as part of step 
1025, to determine whether the client can receive the data 
through the pass through mechanism. 

As shown in the poUcy definition schema, the policy for 
a given web content provider may specify whether the data 
from the web content provider's site can be cached at the 
hosting site. As mentioned above, cadiing at the hosting site 
greatly improves the perceived performance of the pass 
through distribution mechanism. However, some web con- 
tent providers may not wish the hosting site to cache their 
data. The policy can also specify specific caching policies as 
to how long and what type of data may be kept in the cache 
for the particular publisher. Thus, if specified by the caching 
policy, certain components of the web page may be cached 
in the logging step. 

As mentioned above, it is possible that the publisher will 
have multiple policies for specific sections of the web site. 
Preferably, the sections will be organized such that a URL 
can be used to select the correct policy. For example, the 
news section, e.g., www.domainl.coni/news, will be passed 
through, but the product section, www.domainl.com/ 
products, will not be. The portion of the URL, which 
specifies the actual page, after the main portion is ignored in 
the matching step. However, some sites may not be well 
organized and it could be of potential interest to log the 
policy and filter definition used with each transaction as a 
means for taking future corrective action. If an XML tag was 
defined and embedded in the page to specify v^ch policy 
should be used, a tag ID can be logged as well. 

The pass through mechanism can be configured a stand 
alone server software product. This would resemble a proxy 
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server and would serve two purposes: it would help the necessary to republish articles is minimal. Content can be 

speed issue by devoting more resources to the hasting extracted without the content provider web site modifying 

activity, and it would allow the servicing of several hosting content to a special format or installing special purpose 

web sites from a single server. software. Articles in the hosting web site are automatically 

The invention solves several business and technical prob- 5 synchronized with those in the content provider as changes 

lems. It provides an attractive mechanism to obtain pcrmis- ^ m^de at the content provider web site (so long as 

sion to reprint Web-bascd content with Uttlc or no licensing noncached material is used). By abstracting the content from 

fees. Since the original publisher's transaction records are particular content provider site and reformatting the 

preserved, their existing revenue base is maintained through ^^^^^ ^ ^^^^^ ^^.^ ^^^j ^ consistent look and feel 

the number of ad impressions counted. Smce the ad impres- lo ^ maintained 

sions are now also occurring on the hosting web site with ' i- • * • 

very litUe work on the part of the original pubUshcr, the In one preferred embodiment of the mvcnUon, the hosUng 

revenue is very likely to be increased. Thus, increased traffic ^^b server caches content locally to speed delivery to the 

is generated for both the hosting web site as weU as the requesting client and minimize dependency on the content 

content provider's site with very little manual intervention 15 provider web site. In other embodimenU of the invention, 

after configuration. unauthorized requests are blocked, elimiaatiDg a potential 

The invention is very flexible and is easily configured to avenue for abuse of the system and copyright violation, 

accommodate a wide variety of web content. Through the In the attached appendix, examples are given of a content 

use of document templates, filters and policies, the invention provider's original web page, the template in which in 

allows simple modification of these elements to tailor them 20 hosting site inserts the excerpted desired content and the 

to any number of different content providers' formats and resulting recast page with comments. These examples will 

document templates. Once the hosting web server has been help the reader more fuUy understand the principles of the 

configured for a set of content providers, the production staff present invention. 



APPENDIX 



Original Content Provider HTML: 
IBM Global Services ' 

<ht^://www.ibm.com/5erviccs/artide5Avhatwcdo.htinl>What we can do for 

yoa<http:/Avww.fl)m.coin/servioes/business/>^wpoints<hltp:/Avwwjbm.c^ 

Wces/career/>C^ei8<ht^;//www.ibm.com/8ervice8/busine8s/feature.htinl>Cas6 

Studics<http://wwwabm.com/servic3es/prcssT6V>Ncw8<ht^://www.ibni.coin/5ervi 

ces/navtoo Is/othcrservices .htmlxht^ ://www.ibm.coin/Search>Search 

<http:/Avww.ib m. com/serviccs/profc erviccs^dci Jitinl>Profcs5ional 

Scrvices<ht^:/Avww.as.ibin.com/>Product Support 

Services<ht^://www.ibm,.com/globalnetwork/>Nctwork 

Serviccs<ht^ y/www.ibm.coin/seivices/ouiportfolio.htinl>Oui Portfolio 

IBM Announces New e-biisiness Services for Security 

Builds on popular packaged e-business services offerings 

March 24, 1998 

BOSTON, Massachusetts, March 24, 1998 . . . IBM today announced new global 
security services that build on the conq)any*8 portfolio of c-busincss services 
introduced last October. IBM's e-business offerings help business use networics 
aad Internet technologies to more securely buy and sell on the Web and improve 
internal aod external communicatiQn. IBM made these announcements at Internet 
Commerce Expo. 

<../ebus/security.html>IBM Security Services help customers of all sizes 
assess 

and improve security in their con^ting environments. They address exposures 
across operations, including policy and marLagcmcnt systems, ^iplications, 
networks, systems and physical site security. IBM has the unique capability as 
a security services provider to give customas a choice of individual 
offerings or a comprehensive, eod-to-eod security solution. 

*IBM is a registered trademark of International Business Machines Corporation 
<http://www.ibm.com/> IBM Homepage <ht^)://www.ibni.com/Orders/> Order 
<hti://www.ibm.com/Assist/> Contact IBM 
<bttp://www.ibm.com/[BM/En^loyment> 

Employment <ht^)://www.ibm.ccin/Privacy/> Privacy <http://www.ibm.com/Legal/> 
Legal 

The Hosting Site Web Page Template 
Home 

<bttp://dev2.aoss-site.ooni/apps/top,map> Need Help? Click on the '?* 
<htlp://dev2.aoss-site.oom/apps/side.m^> Need He^? Qick on the '?' 
<ht^://dev2.cross-8ite.oom/cfi/?sectioii«New8&text«newsMews.html>News | 
<ht^://E.dejanews.com/crosssite/>Forums | 

<http ://dev2. cros5-site.cam/cs/?8ectioii«Columns&text-columns/column8. html>C 
olumns I 

<fatlp://dcv2.aos5-site.com/cs/?section>Resources&tcxtBrcsourccs/resources. 
htnil>Re60urces | 

<bt^ ://dev2. aoss-site.coni/cs/?6ection-Down]oads& text-downloads/do wnloads . 
htjnl>Downloads | 

<http ://devZ aoss-site.com/c8/?section-Cross-Site&text=«bout/abouthtml>Abo 
ut| 

<ht^ ://dev2. cross-site.OQni/c&/?section->Products&textsproducts/pFoducts. htm 
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l&6id&bai-products/sidcbar.litinl>Pio ducts | 

<ht^://dev2.cross-site.cxmiycs/78ectionoEmploymeiit&teitBemployineijVcmploym 
LhtmUEs:^ Loyment 

<htq)://devZcross-site.comycs/?sidebar"home/8idebar.htinl>Home | 
<ht^://dev2.cros8-6iU.coiiL/cs/7section-Seaidi&text*SLtesearcli/uaich.htinl&t 
LdeaSeaich&logo»logo.crosssitc>Seaich | 

<htlp://de\^cjoss-site.coin/cs/?section"MaU&text^imU/mflil.html>Email | 

<http://dev2.cToss-site.cam/ca/?section-<^ntoct&text-about/contacUhtm^ 

act 

I <ht^ ://dev2.cross-fiitc .coin/cs/?section=Help &texU»support/lielp. htmt>Help 

(C)1998 HvoU Systems 

The Recast Web Page (induding comments): 

(Tha paifiing engine extracted this code from the URL): 

<IMO SRCo"http:y/www.ibnLcoin/serviccs^inagcs/animh.giP* alt^"IBM Global 

Services" WIDra-584 HEIGIir=54 BORDER«0><br> 

^^TABLE WIDTH-584 CELLSPACING-0 CELLEADDING-0 BORDER-0> 

<TR><^rD><NOBR><A 

HREF-"ht^ ;//www.ibm.com/6ervices/articlefi/whatwedo .html" 
TAROET-_top > <IMO SRC="ht^ y/www.ibin.coin/scivtce5/iiiiagcs/fbryou3 .giT 
AlT-"What we can do for you" WIDTHol45 HEIGHT-IS BORDER=0></A><A 
HREF-**http y/www.ibm.com/services/bu5iness/* TARGEr-_top><IMG 
SRO"ht^ ://www. ibin.eom/scj:vices/iinagefiAfiewpl3.gif' ALT"* Viewpoints" 
WIDTH-81 HEIGHT-18 BORDER-0></A><A 
HREF="http ://www.ibm.com/5ervicc5/carccr/" TARGET=_top><IMG 
SRC"" htqj ://www. ib m. com/services/iniages/careers3 .gif ' ALTo"CareeTS" 
WimH-67 HEIGHT-18 BORDER-0><yA><A 

HREF-"http y/www.ibm.com/services/business/featurt html" TARGET=_top><IMO 
SRC-^httpiZ/www. ibm.com/scrvices/images/casestdy3. gif" ALr»**Case Studies" 
WnytH-90 HEIGHr-18 BORDER-0></A><A 
HREF="ht^) *y/www.ib m.com/services/ipressrcir TARGEr-_top><IMG 
SRC-"http://www.ibm.com/scrvice8/images/ncw83.gif • AlT-^News" WID'rH-52 
HEIGErr=18 BORDER=0>^A><A 

HREF="ht^ ://www.ib m.com/services/navtocls/otherservices.htmr> <IMG 
SRC-** http://www.ibm.com/services/images/countiysites .gif * WIDTH-87 
EIEIGEITolS BORDER=0><;/A><A HREF="http;//www.ibm.oom/Search" 
TARGET-_top><IMG 

SRO** http ://www.ibm. com/scrviccs/imagcs/5carch3.gif ' ALT=*'Scarch" 

BORDER««0><yA></NOBR></rD></TR> 

VrABLE> 

(It then iasertcd the code into the hosting site's template, thusly:) 

<CENTER> 

-OABLE BORDER=0> 

<frR> 

<TD> 

•dMO SRO"http://www.ibm.com/scivicesAmagcs/animh.gif" alt=«"IBM Global 

Services" W1DTH=S84 HEIGHr=54 BORDER=0><br> 

-^ABLE WIDTH-584 CELLSPACING-0 CELLEADDING-0 BORDER-0> 

<TR><TD><NOBR><A 

HREF-*^ht^ y/www.ib m,com/services/articles/whatwedo .html" 
TARGET-_top><IMO SRC-"http ;//www. ibm.com/scrviccfi/imagcsyfbryou3. giT 
ALr="What we can do for you" WIDTH=145 HEIGHT-IS BORDER-Ox/AxA 
fIREF-*'htip y/www.ib m.com/services/business/* TARGEr-_top> <IMG 
SRO"http ://www.ibm. oom/seivices/iinages/viewpt3 .gif * ALT"" Viewpoints" 
WlDTH-81 HEIGHT-18 BORDER-0></A><A 
HREF«**ht^ y/www.ib m.com/scrviccs/carccr/'* TARGET=_top > <IMO 
SRC=**ht^://www.ibm. Qom/services/images/caieers3 .gif' AlT-"Careers" 
WIDm-67 HEIGHr-18 BORDER-0></A><A 

HREF="http ://www.ib m.com/services/business/featurt html" TARGET-_top><IMO 
SRCo"http://www. ibm.com/services/images/casestdy3. gif AIX""Case 5tu(Ues" 
WIDTH-90 HEIGHT-18 BORDER-0>VA><A 
HREF«"http ://www.ib m-Com/services/pressrelT TARGEr»_top><IMG 
SRC-*'http://www.ibm.com/servicea/images/ncw53.giP' AU-"News" WIDTH-52 
HEIGHT-IS BORDER-0></A><A 

HREF-**ht^;//www.xbmxomy8ervices/navtools/otherseivices.htmr><IMG 
SRC-**http ://www. ibm. oom/services/images/countiysites .gif' WIDTH-S? 
HEIGETr=18 BORDER=0><;/A><A HREF-"ht^)://www.ibm.com/Seardi" 
TARGET-_top><IMG 

SRC=**ht^://www. ibm. com/scTviccs/imagcs/scarch3.gif ' AlT-**Scarch" 

BOBaDER=0></A></NOBR></TD></rR> 

^ABLE> 

</TD> 

</TR> 

</TABLE> 

</CENTER> 

<A NAME-*'m)P"><;/A> 

<FOSr SIZE-"+l" COLOR-'-tfOOOOPy' FACE-"Arial, Helvctica"> 

<B>New8<B> 

^NT> 

<!- START TOP NAV BUTTONS --> 
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^ABLE CELLPADDINCM) CELLSFAaNCM) BORDER-0 WIDTH-100%> 
■OT ALIGN-RTGHr VAUGN-TDP> 

BGCOLOR-FKX;33 ALIGN-RIGHT VALIGN-CENTTER BORDER-0 
WIDTH-100% CX>LSPAN-2> 
<A HREF-"ht^ ://dev2.cioss-6ite.coin/app8/top .map"> 
<IMG NAME""topbuttons" HEIOHT=35 WIDTH=175 
SRC<i**tittp ://dev2.ciGSS-site. cocnyimages/topbuuons.gif * 
BORDER-0 AlX-"N«d Help? Click on the ISMAP ></A> 
</TD> 
</TR> 

<!- END TOP NAV BUTTONS -> 

(Similarly^ the template has this mscrtion spot for the article from the 
content provider* a document:) 
<aABLE BORDER-»0> 

<JTD> 
</TR> 
<JTABUB> 

(Into which the extracted article is inserted:) 
<H3> 

[BM Announces New e-business Services for Security 

<BR><SMALL>Builds on popular packaged e-business services 
offerings </SMALL> 
</H3> 

<P><B>March 24, im</B></?> 

<P>BOSTON, Massachusetts, March 24, 1998 . . . IBM today announced new global 

security services that build on the con^any's portfolio of e-business 

services introduced last October. IBM's e-business offerings help business 

use netwodcs and Internet technologies to more securely buy and sell on the 

Web and improve internal and external communication. IBM made these 

announcements at Internet Commerce Expo. 

<p> <a href«../ebus/security.htm]>IBM Security Services <;/a> help 

customers of all sizes assess and improve security in their computing 

eavironmeats. They address exposures across operations, iacluding policy 

and management systems, applications, networks, systems and physical site 

security. IBM has the unique capability as a security services provider to 

give customers a choice of individual offerings or a comprdiensive, 

end-to-end security solution. 

<BR> 

</FONT> 

</TD> 

vrR> 

</TABLE> 

(The end result is a unified HTML document with elements from the 
publisher's page inserted into the host site's template to create a 
seamless whole.) 



While the invention has been shown aod described with 
reference to particular embodiments thereof, it wiU be 
understood by those skilled in the art that the invention can 
be practiced, with modification, in other environments. For 
example, although the invention described above can be 
conveniently implemented in a general puipose computer 
selectively reconfigured or activated by software, those 
skilled in the art would recognize that the invention could be 
carried out in hardware, in firmware or in any combination 
of software, firmware or hardware including a special pur- 
pose apparatus specifically designed to perform the 
described invention. Therefore, changes in form and detail S5 
may be made therein without departing firom the spirit and 
scope of the invention as set forth in the accompanying 
claims. 

We claim: 

1. A method for defining a filter used to extract web go 
content for a web page wherein the extracted content is used 
in a recast web page produced by a hosting site, comprising 
the steps of: 

retrieving multiple versions of at least one original web 
page firom a content provider web server; es 

parsing the multiple versions of the original web page to 
identify a set of selectable content elements; 



comparing the multiple versions of the web page to 
identify static and dynamic content elements; 

presenting a representation of the original web page in a 
user interface, wherein the selectable content elements 
are demarcated and marked as either static or dynamic 
elements; 

responsive to user input, selecting content elements for 

inclusion in the filter; and 
constructing the filter so that the selected content elements 

are extracted from a retrieved web page from the 

content provider web server and reused in the recast 

web page. 

2. The method as recited in claim 1, wherein a plurality of 
web pages from the content provider web server are parsed 
to identify the set of selectable content elements. 

3. The method as recited in claim 1, wherein a set of 
varied headers are used to retrieve multiple versions of the 
same web page. 

4. The method as recited in claim 1, further comprising 
the steps of: 

associating a URL with the filter; and 
using the filter to extract web content firom web pages 
firom the associated URL. 
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5. The method as redted in claim 1, further comprising 
the steps of: 

associating a label with each respective selected content 
element; 

xising the filter to extract selected content elements from ^ 
a web page from a web content provider web site; 

using the associated labels to insert the selected content 
elements into a web page template containing a hosting 
web server format, thus creating the recast web page; 
and 

serving the recast web page to the client browser; 

wherein the appearance of the recast page when presented 
by the client browser is as though all elements origi- 
nated at the hosting web server. ^5 

6. The method as recited in claim 5, wherein one of the 
desired content elements is an advertisement element from 
the content provider web server, and the method further 
comprises the step of inserting a call back to the content 
provider web server for the advertising element. 20 

7. The method as recited in claim 5, further comprising 
the step of processing the desired content elements to 
eliminate harmful code, prior to insertion in the web page 
template. 

8. A method for defining a filter used to extract web 25 
content from a web page for reuse in a recast web page, 
comprising the steps of: 

parsing a web page to identify a set of selectable content 
elements; 

parsing multiple versions of the web page to identify 30 
dynamic and static selectable content elements; 

presenting a representation of the original web page in a 
user interface, wherein whether a given selectable 
content element is dynamic or static is indicated; 

responsive to user input, selecting content elements for 
inclusion in the filter; and 

constructing the filter so that the selected content elements 
are extracted from a retrieved web page from the web 
server and reused in the recast web page. 

9. The method as recited in claim 8, further comprising ^ 
the steps of: 

selecting at least one web page representative of a set of 

web pages on a web server; and 
including link data in the filter so that when one of the set 

of pages is called, the filter is used to extract selected 

content elements from the called page. 

10. The method as recited in claim 9, wherein a plurality 
of filters arc constructed for a web site on the web server, 
each for a respective set of pages on the web site. 

11. The method as recited in claim 9, wherein the link data 
included in the filter is a URL having a wildcarded ending. 

12. The method as recited claim 8, further comprising the 
steps of: 

calling a set of web pages from a web server for a web 
site; 

using the filter to extract selected content elements from 

each of the set of web pages; 
using the extracted content elements to construct a new 

set of web pages for the web site. go 

13. A method for defining a filter used to extract web 
content from a web page for reuse in a recast web page, 
comprising the steps of: 

parsing a web page to identify a set of selectable content 
elements; 6S 

presenting a representation of the original web page in a 
\iser interface; 
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responsive to detecting selection of a content element, 

presenting a pop-up of labels available for the selected 

content element; 
responsive to selection of one of the labels, associating the 

label with the selected content element; 
responsive to user input, selecting content elements for 

inclusion in the filter; and 
constructing the filler so that when the filter is used the 

selected content elements are extracted from a retrieved 

web page from the web server and reused in the recast 

web page. 

14. The method as recited in claim 12, further comprising 
the steps of: 

parsing data associated with each selectable content ele- 
ment; 

matching the parsed data to data in a table of available 
labels, each available label corresponding to respective 
web page data; and 

responsive to a match of the parsed data to data in the 
table, highlighting the corresponding label in the pop- 
up of labels. 

15. The method as recited in claim 8, further comprising 
the step of presenting a demarcation of each selectable 
element in the web page representation. 

16. The method as recited in claim 8, further comprising 
the steps of: 

determining dient specific information about a client 
browser from which a request originated; 

selecting among a set of filters stored in a filter definition 
database on a Ifdsting server based on the client specific 
information^wherein each of the filters extracts differ- 
ent selected content elements from a web page; and 

using the selected filter for creating a recast web page to 
be sent to the client browser. 

17. A system including processor and memory for defin- 
ing a filter used to extract web content from a web page for 
reuse in a recast web page, comprising: 

means for parsing a web page to identify a set of select- 
able content elements; 

means for parsing multiple versions of the web page to 
identify dynamic and static selectable content ele- 
ments; 

means for presenting a representation of the original web 
page in a user interface having user input sensitive 
areas conesponding to respective selectable content 
elements, wherein whether a given selectable content 
element is dynamic or static is indicated; 

means responsive to user input for selecting content 
elements for inclusion in the filter; and 

means for constructing the filter so that the selected 
content elements are extracted from a retrieved web 
page from the web server and reused in the recast web 
page. 

18. The system as recited in claim 17, wherein the system 
is a hosting system further comprising: 

means for receiving requests from client browsers; 
means for retrieving web pages from web content pro- 
vider servers; 

means for using the filter to extract selected content 

elements in the retrieved pages; 
means for recasting the extracted content elements in 

recast pages; and 
means for sending the recast pages to the client browsers. 
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19. The system as recited in claim 18 further comprising: 
means for storing constructed filters; 

means for selecting a filter from the storing means; and 
means for using the selected filter for extracting selected 
content elements from the received web pages for 5 
constructing recast web pages in a hosting server 
format. 

20. The method as recited in claim 17, further comprising: 
means for selecting at least one web page representative 

of a set of web pages on a web server; and lo 
means for including link data in the filter; and 
means for using the included link data so that when one 
of the set of pages is retrieved responsive to a client 
request, the filter is used to extract selected content 
elements from the retrieved page. 15 

21. The system as recited in claim 17, further comprising: 
a store for a plurality of filters, wherein a set of the 

plurality of filters is constructed for a content provider 
web site on a web server, each filter for a respective set 
of pages on the content provider web site 20 
means for presenting a representation of the original web 
page in a user interface having user input sensitive 
areas corresponding to respective selectable content 
elements. 

22. A computer program product in a computer readable 25 
medium for defining a filter used to extract web content from 

a web page for reuse in a recast web page, comprising: 
means for parsing a web page to identify a set of select- 
able content elements; 
means for parsing multiple versions of the web page to ^ 
identify dynamic and static selectable content ele- 
ments; 

means for presenting a representation of the original web 
page in a user interface, wherein whether a given 
selectable content element is dynamic or static is indi- 
cated; 

means responsive to user input for selecting content 

elements for inclusion in the filter; and 
means for constructing the filter so that the selected ^ 

content elements are extracted from a retrieved web 

page fi-om the web server and reused in the recast web 

page. 
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23. The product as recited in claim 22 further comprising 
means for linking each selected content element with a 
template for creating a recast web page. 

24. The product as recited in claim 23, further comprising: 
means for selecting at least one web page representative 

of a set of web pages on a web server; and 
means for including link data in the filter so that when one 
of the set of pages is called, the filter is used to extract 
selected content elements firom the called page. 

25. The product as recited in claim 22, further comprising: 
means for retrieving a set of web pages firom a web server 

for a web site; 

means for using the filter to extract selected content 
elements from each of the set of web pages; 

means for using the extracted content elements to con- 
struct a new set of web pages for the web site. 

26. A computer program product in a computer readable 
medium for defining a filter used to extract web content from 
a web page for reuse in a recast web page, comprising: 

means for parsing a web page to identify a set of select- 
able content elements; 

means for presenting a representation of the original web 
page in a user interface; 

means for presenting a set of labels available for the 
selected content element; 

means for associating selected labels with respective 
selected content elements; 

means responsive to user input for selecting content 
elements for inclusion in the filter; and 

means for constructing the filter so that the selected 
content elements are extracted from a retrieved web 
page from the web server and reused in the recast web 
page. 

27. The product as recited in claim 22, further comprising 
means for presenting a demarcation of each selectable 
element in the web page representation. 
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